{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "cellType": "markdown",
    "uuid": "e3c3f525-27c0-4b3f-a0fc-56abf27d22e6"
   },
   "source": [
    "## 一个简单Word2Vec的例子\n",
    "\n",
    "这个例子展示了如何利用DSW进行构造一个Word2Vec的学习，Word2Vec是NLP训练一个比较基础的数据处理方式，具体原理大家可以仔细阅读论文《[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)》\n",
    "\n",
    "我们准备一个文章text，然后通过这个文章的信息来学习出文章中单词的一个向量表达，使得单词的空间距离能够体现单词语义距离，使得越靠近的单词语义越相似。\n",
    "\n",
    "首先我们把文章读入到words这个数组中。可以看到这个文章总共包括17005207单词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "uuid": "0d3bdd3e-1d53-4ca7-9d46-c873ebcaa356"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "words size 17005207\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "import zipfile\n",
    "\n",
    "with zipfile.ZipFile(\"text.zip\") as f:\n",
    "    words = tf.compat.as_str(f.read(f.namelist()[0])).split()\n",
    "    \n",
    "print('words size', len(words))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellType": "markdown",
    "uuid": "7d34c1bf-6eca-4490-88d9-b9a85825e791"
   },
   "source": [
    "然后我们准备一个词典，例子中我们限制一下这个词典最大size为50000, 字典中我们保存文章最多出现49999个词，然后其他词都当做'UNK'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "uuid": "39833c9f-1f62-4400-9b19-e90a6a445e39"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "最多5个单词以及出现次数 [('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201)]\n"
     ]
    }
   ],
   "source": [
    "import collections\n",
    "import math\n",
    "\n",
    "vocabulary_size = 50000\n",
    "count = [['UNK', -1]]\n",
    "count.extend(collections.Counter(words).most_common(vocabulary_size - 1))\n",
    "print(\"最多5个单词以及出现次数\", count[1:6])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "uuid": "3f6055b5-da24-435b-b3cd-dc1dbd8406c3"
   },
   "outputs": [],
   "source": [
    "为了后面训练的方便，我们把单词用字典的index来进行标识，并且把原文用这个进行编码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "uuid": "02cb50f5-baee-4777-b494-cba47a56f24a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "编码后文章为 [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156] ...\n"
     ]
    }
   ],
   "source": [
    "dictionary = dict()\n",
    "for word, _ in count:\n",
    "    dictionary[word] = len(dictionary)\n",
    "data = list()\n",
    "unk_count = 0\n",
    "for word in words:\n",
    "    if word in dictionary:\n",
    "        index = dictionary[word]\n",
    "    else:\n",
    "        index = 0  # dictionary['UNK']\n",
    "    unk_count += 1\n",
    "    data.append(index)\n",
    "count[0][1] = unk_count\n",
    "\n",
    "print('编码后文章为', data[:10], '...')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellType": "markdown",
    "uuid": "fa7845e0-aa02-44f4-93b8-7d3f4d8270fa"
   },
   "source": [
    "建立一个方向查找表，等学习完，可以把单词的编码又变回到原单词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "uuid": "de002ada-8550-4708-9c5c-f9404f127447"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']\n"
     ]
    }
   ],
   "source": [
    "reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
    "print([reverse_dictionary[i] for i in data[:10]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellType": "markdown",
    "uuid": "74367286-10c7-4ada-aaa9-553b9f5f7499"
   },
   "source": [
    "接下来我们进入正题，我们构造训练word2vec的样本，也就是skip gram论文中的方法，把这个定义为一个函数，留到后面训练时候用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "uuid": "69600e9c-cf69-4914-9337-fd0a603eee34"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random\n",
    "data_index = 0\n",
    "def generate_batch(batch_size, num_skips, skip_window):\n",
    "    global data_index\n",
    "    batch = np.ndarray(shape=(batch_size), dtype=np.int32)\n",
    "    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)\n",
    "    span = 2 * skip_window + 1 # [ skip_window target skip_window ]\n",
    "    buffer = collections.deque(maxlen=span)\n",
    "    for _ in range(span):\n",
    "        buffer.append(data[data_index])\n",
    "        data_index = (data_index + 1) % len(data)\n",
    "    for i in range(batch_size // num_skips):\n",
    "        target = skip_window  # target label at the center of the buffer\n",
    "        targets_to_avoid = [ skip_window ]\n",
    "        for j in range(num_skips):\n",
    "            while target in targets_to_avoid:\n",
    "                target = random.randint(0, span - 1)\n",
    "            targets_to_avoid.append(target)\n",
    "            batch[i * num_skips + j] = buffer[skip_window]\n",
    "            labels[i * num_skips + j, 0] = buffer[target]\n",
    "        buffer.append(data[data_index])\n",
    "        data_index = (data_index + 1) % len(data)\n",
    "    return batch, labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellType": "markdown",
    "uuid": "0faa5a27-324b-44bc-b1db-b3fa8f7f401e"
   },
   "source": [
    "我们观察这个训练样本的样子"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "uuid": "76581287-f85a-4607-a200-f6869a6a6f52"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3081 originated -> 12 as\n",
      "3081 originated -> 5234 anarchism\n",
      "12 as -> 3081 originated\n",
      "12 as -> 6 a\n",
      "6 a -> 12 as\n",
      "6 a -> 195 term\n",
      "195 term -> 2 of\n",
      "195 term -> 6 a\n"
     ]
    }
   ],
   "source": [
    "batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)\n",
    "for i in range(8):\n",
    "    print(batch[i], reverse_dictionary[batch[i]],\n",
    "        '->', labels[i, 0], reverse_dictionary[labels[i, 0]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellType": "markdown",
    "uuid": "d59134c0-dce7-4c1a-a409-5c6e68df9137"
   },
   "source": [
    "现在我们来定义训练说需要的DNN模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "uuid": "86b87d7f-8b6f-4ead-a94e-6b594e1d29c5"
   },
   "outputs": [],
   "source": [
    "batch_size = 128\n",
    "embedding_size = 128  # Dimension of the embedding vector.\n",
    "skip_window = 1       # How many words to consider left and right.\n",
    "num_skips = 2         # How many times to reuse an input to generate a label.\n",
    "\n",
    "# We pick a random validation set to sample nearest neighbors. Here we limit the\n",
    "# validation samples to the words that have a low numeric ID, which by\n",
    "# construction are also the most frequent.\n",
    "valid_size = 16     # Random set of words to evaluate similarity on.\n",
    "valid_window = 100  # Only pick dev samples in the head of the distribution.\n",
    "valid_examples = np.random.choice(valid_window, valid_size, replace=False)\n",
    "num_sampled = 64    # Number of negative examples to sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "uuid": "74303189-a5ba-4287-9ab2-098508469166"
   },
   "outputs": [],
   "source": [
    "train_inputs = tf.placeholder(tf.int32, shape=[batch_size])\n",
    "train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])\n",
    "valid_dataset = tf.constant(valid_examples, dtype=tf.int32)\n",
    "\n",
    "embeddings = tf.Variable(\n",
    "    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))\n",
    "embed = tf.nn.embedding_lookup(embeddings, train_inputs)\n",
    "\n",
    "nce_weights = tf.Variable(\n",
    "    tf.truncated_normal([vocabulary_size, embedding_size],\n",
    "                            stddev=1.0 / math.sqrt(embedding_size)))\n",
    "nce_biases = tf.Variable(tf.zeros([vocabulary_size]))\n",
    "\n",
    "loss = tf.reduce_mean(\n",
    "    tf.nn.nce_loss(weights=nce_weights,\n",
    "                     biases=nce_biases,\n",
    "                     labels=train_labels,\n",
    "                     inputs=embed,\n",
    "                     num_sampled=num_sampled,\n",
    "                     num_classes=vocabulary_size))\n",
    "\n",
    "optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)\n",
    "\n",
    "norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))\n",
    "normalized_embeddings = embeddings / norm\n",
    "valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)\n",
    "similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "uuid": "c1794e27-208c-48a4-977d-ee2100e2c603"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/admin/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).\n",
      "  warnings.warn('An interactive session is already active. This can '\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average loss at step  0 :  293.93377685546875\n",
      "Nearest to more: ona, conquistador, loa, vichy, escoffier, miss, humber, gulliver,\n",
      "Nearest to six: degrees, californian, ascertaining, internal, undecidable, lease, meteorologists, terabytes,\n",
      "Nearest to be: jupiter, surinam, detritus, involves, spongebob, cluniac, mulholland, postgraduate,\n",
      "Nearest to known: noticing, senators, scripture, hahn, demolished, castilian, modernity, wiser,\n",
      "Nearest to two: gions, composing, flames, foolish, tempting, ali, silt, captive,\n",
      "Nearest to use: passport, morgue, uma, messiaen, scrap, studio, excelled, garcia,\n",
      "Nearest to for: patricio, klerk, puritanical, modernity, arminius, belgica, lenovo, pentagonal,\n",
      "Nearest to other: marianas, aarseth, legates, justus, confocal, exe, indented, rockwell,\n",
      "Nearest to first: ingredients, massing, bigelow, disarm, allein, discouraged, cygnus, greenberg,\n",
      "Nearest to from: batter, clarified, unpaired, neurological, monetarism, shoe, imprisoned, affair,\n",
      "Nearest to after: rightist, diz, formulaic, police, apap, nanotubes, samaria, enters,\n",
      "Nearest to there: oxidise, uttar, ndebele, quotient, newsgroups, larry, omnivores, bites,\n",
      "Nearest to years: vasco, outlined, bashar, aramaean, crashed, muscle, coinciding, lombardy,\n",
      "Nearest to between: emphasised, quadrature, alcibiades, muted, symbolist, traps, spokes, cordell,\n",
      "Nearest to nine: symbolize, sacramento, zenith, yad, hemp, repeaters, syst, tactician,\n",
      "Nearest to so: hab, winston, tendons, ethnic, signatories, wrecking, transportation, training,\n",
      "Average loss at step  2000 :  113.83043353939057\n",
      "Average loss at step  4000 :  52.04903098034859\n",
      "Average loss at step  6000 :  33.55314987397194\n",
      "Average loss at step  8000 :  23.63758887195587\n",
      "Average loss at step  10000 :  18.26611834061146\n",
      "Nearest to more: miss, refrain, located, icelandic, abbots, austin, vs, basins,\n",
      "Nearest to six: zero, nine, austin, five, gollancz, altenberg, vaccination, vs,\n",
      "Nearest to be: gland, jupiter, six, have, parallels, involves, inequality, died,\n",
      "Nearest to known: sign, modernity, aspiration, analogue, senators, cheap, converter, archive,\n",
      "Nearest to two: one, victoriae, gland, austin, nine, reginae, alpina, afghani,\n",
      "Nearest to use: reginae, seed, passport, victoriae, anatole, austin, studio, aluminium,\n",
      "Nearest to for: and, of, with, in, by, sanjaks, to, as,\n",
      "Nearest to other: austin, hit, phylum, dreyfus, roman, reginae, households, indented,\n",
      "Nearest to first: austin, discouraged, mantle, version, genocide, airport, vaccination, barrel,\n",
      "Nearest to from: in, of, vs, and, to, tanto, neurological, near,\n",
      "Nearest to after: two, across, one, extremely, police, egg, reginae, ludwig,\n",
      "Nearest to there: larry, easily, buddhism, buried, means, history, assigned, persecute,\n",
      "Nearest to years: painter, zero, detector, outlined, principal, colspan, analogue, anarchism,\n",
      "Nearest to between: emphasised, in, and, vaccination, nine, of, sa, resident,\n",
      "Nearest to nine: zero, austin, victoriae, reginae, alpina, six, implicit, vs,\n",
      "Nearest to so: one, ethnic, transportation, winston, territories, competition, austin, chloride,\n",
      "Average loss at step  12000 :  14.036011875867844\n",
      "Average loss at step  14000 :  11.746295873045922\n",
      "Average loss at step  16000 :  9.960373474597931\n",
      "Average loss at step  18000 :  8.668071280956267\n",
      "Average loss at step  20000 :  7.740298974275589\n",
      "Nearest to more: absalom, refrain, miss, located, hebron, abbots, icelandic, intended,\n",
      "Nearest to six: nine, eight, five, seven, zero, dasyprocta, two, four,\n",
      "Nearest to be: have, by, parallels, six, was, make, as, veil,\n",
      "Nearest to known: aspiration, noticing, cheap, demolished, lahore, modernity, dexter, atkinson,\n",
      "Nearest to two: five, three, one, seven, six, eight, dasyprocta, four,\n",
      "Nearest to use: seed, dasyprocta, backslash, reginae, passport, anatole, uma, homomorphism,\n",
      "Nearest to for: in, with, of, and, by, to, as, dasyprocta,\n",
      "Nearest to other: indented, households, nn, hit, multiple, roman, austin, edgar,\n",
      "Nearest to first: backslash, austin, for, discouraged, vaccination, bigelow, numa, mantle,\n",
      "Nearest to from: in, of, to, and, at, vs, for, agouti,\n",
      "Nearest to after: dasyprocta, across, two, and, in, with, as, heterosexual,\n",
      "Nearest to there: it, they, which, he, easily, larry, buried, newsgroups,\n",
      "Nearest to years: two, lombardy, detector, outlined, dasyprocta, gather, imran, painter,\n",
      "Nearest to between: in, and, dasyprocta, of, vaccination, emphasised, quadrature, seven,\n",
      "Nearest to nine: eight, seven, six, zero, dasyprocta, five, agouti, four,\n",
      "Nearest to so: winston, ethnic, endeavour, signatories, transportation, territories, competition, defended,\n",
      "Average loss at step  22000 :  7.200594078540802\n",
      "Average loss at step  24000 :  6.964842308521271\n",
      "Average loss at step  26000 :  6.659653395533562\n",
      "Average loss at step  28000 :  6.239654780030251\n",
      "Average loss at step  30000 :  6.1383238241672515\n",
      "Nearest to more: absalom, refrain, ona, hebron, humber, intended, abbots, miss,\n",
      "Nearest to six: eight, five, nine, seven, four, three, zero, two,\n",
      "Nearest to be: have, by, was, parallels, make, as, is, veil,\n",
      "Nearest to known: reuptake, hamas, aspiration, atkinson, noticing, modernity, cheap, dexter,\n",
      "Nearest to two: four, three, five, one, six, seven, eight, dasyprocta,\n",
      "Nearest to use: passport, seed, dasyprocta, microseconds, backslash, abitibi, risk, positivists,\n",
      "Nearest to for: with, of, in, and, by, to, from, dasyprocta,\n",
      "Nearest to other: reuptake, ligature, rockwell, nn, austin, multiple, households, hit,\n",
      "Nearest to first: backslash, austin, numa, potsdam, vaccination, discouraged, on, for,\n",
      "Nearest to from: in, and, of, vs, at, for, on, amalthea,\n",
      "Nearest to after: abitibi, two, dasyprocta, in, and, across, with, as,\n",
      "Nearest to there: it, they, he, which, easily, buried, now, larry,\n",
      "Nearest to years: lombardy, two, autocad, detector, gather, outlined, crashed, colspan,\n",
      "Nearest to between: and, in, of, with, dasyprocta, on, two, emphasised,\n",
      "Nearest to nine: eight, seven, six, five, four, three, zero, dasyprocta,\n",
      "Nearest to so: winston, endeavour, amalthea, ethnic, signatories, advantages, territories, ending,\n",
      "Average loss at step  32000 :  5.876646659016609\n",
      "Average loss at step  34000 :  5.817600245714187\n",
      "Average loss at step  36000 :  5.715285798311234\n",
      "Average loss at step  38000 :  5.253283932566643\n",
      "Average loss at step  40000 :  5.463549421310425\n",
      "Nearest to more: absalom, not, ops, refrain, intended, roskilde, library, hebron,\n",
      "Nearest to six: seven, four, eight, five, three, nine, zero, two,\n",
      "Nearest to be: have, by, was, is, parallels, make, been, were,\n",
      "Nearest to known: noticing, reuptake, modernity, atkinson, rakyat, hamas, used, exploding,\n",
      "Nearest to two: three, four, six, five, one, seven, eight, dasyprocta,\n",
      "Nearest to use: passport, dasyprocta, backslash, microseconds, risk, abitibi, uma, homomorphism,\n",
      "Nearest to for: with, in, of, to, from, dasyprocta, by, and,\n",
      "Nearest to other: reuptake, ligature, two, multiple, rockwell, six, austin, nn,\n",
      "Nearest to first: backslash, austin, potsdam, numa, lerner, vaccination, lipids, hus,\n",
      "Nearest to from: in, vs, on, agouti, at, through, of, amalthea,\n",
      "Nearest to after: abitibi, dasyprocta, and, with, as, across, two, in,\n",
      "Nearest to there: they, it, he, which, easily, now, philanthropy, buried,\n",
      "Nearest to years: lombardy, two, six, autocad, detector, one, gather, vasco,\n",
      "Nearest to between: in, with, and, recitative, from, dasyprocta, to, on,\n",
      "Nearest to nine: eight, seven, six, zero, five, three, four, dasyprocta,\n",
      "Nearest to so: winston, endeavour, amalthea, signatories, ethnic, advantages, fidonet, territories,\n",
      "Average loss at step  42000 :  5.311039533734322\n",
      "Average loss at step  44000 :  5.285898304700852\n",
      "Average loss at step  46000 :  5.267841118693352\n",
      "Average loss at step  48000 :  5.037707761883736\n",
      "Average loss at step  50000 :  5.153288539290428\n",
      "Nearest to more: absalom, not, less, trolleybus, roskilde, ops, refrain, intended,\n",
      "Nearest to six: eight, four, seven, five, three, nine, two, one,\n",
      "Nearest to be: have, by, was, is, were, make, been, are,\n",
      "Nearest to known: noticing, reuptake, used, atkinson, modernity, rakyat, exploding, hahn,\n",
      "Nearest to two: three, one, four, six, five, eight, seven, dasyprocta,\n",
      "Nearest to use: risk, dasyprocta, passport, backslash, microseconds, abitibi, reginae, seed,\n",
      "Nearest to for: in, with, and, of, from, dasyprocta, to, against,\n",
      "Nearest to other: reuptake, multiple, ligature, austin, two, rockwell, certain, nn,\n",
      "Nearest to first: backslash, bigelow, potsdam, numa, austin, lerner, abitibi, word,\n",
      "Nearest to from: in, through, at, vs, on, agouti, and, into,\n",
      "Nearest to after: abitibi, prism, dasyprocta, as, in, and, three, when,\n",
      "Nearest to there: it, they, he, which, easily, now, philanthropy, buried,\n",
      "Nearest to years: lombardy, two, aramaean, autocad, gather, detector, outlined, colspan,\n",
      "Nearest to between: with, in, dasyprocta, and, from, recitative, vaccination, everywhere,\n",
      "Nearest to nine: eight, seven, six, zero, four, three, five, agouti,\n",
      "Nearest to so: winston, endeavour, signatories, amalthea, ethnic, advantages, fidonet, territories,\n",
      "Average loss at step  52000 :  5.159958806276322\n",
      "Average loss at step  54000 :  5.105936750173568\n",
      "Average loss at step  56000 :  5.042992538094521\n",
      "Average loss at step  58000 :  5.137826075315475\n",
      "Average loss at step  60000 :  4.93975215446949\n",
      "Nearest to more: less, absalom, microsite, not, roskilde, quantifiers, trolleybus, microscopy,\n",
      "Nearest to six: eight, five, four, seven, nine, three, zero, dasyprocta,\n",
      "Nearest to be: have, by, been, was, were, is, make, refer,\n",
      "Nearest to known: used, noticing, modernity, reuptake, atkinson, rakyat, exploding, hahn,\n",
      "Nearest to two: three, four, one, five, six, seven, eight, dasyprocta,\n",
      "Nearest to use: risk, passport, dasyprocta, microseconds, backslash, callithrix, abitibi, substituting,\n",
      "Nearest to for: of, in, or, and, with, to, dasyprocta, against,\n",
      "Nearest to other: reuptake, multiple, xhtml, conic, cebus, rockwell, ligature, michelob,\n",
      "Nearest to first: backslash, bigelow, numa, potsdam, austin, tamarin, lerner, heavy,\n",
      "Nearest to from: in, through, into, at, vs, agouti, and, amalthea,\n",
      "Nearest to after: in, as, before, prism, when, abitibi, dasyprocta, marmoset,\n",
      "Nearest to there: they, it, he, which, easily, now, cebus, hardly,\n",
      "Nearest to years: vasco, lombardy, autocad, aramaean, four, callithrix, outlined, detector,\n",
      "Nearest to between: with, in, everywhere, vaccination, recitative, dasyprocta, from, on,\n",
      "Nearest to nine: eight, six, seven, five, four, zero, three, callithrix,\n",
      "Nearest to so: endeavour, tamarin, winston, advantages, ethnic, amalthea, msg, fidonet,\n",
      "Average loss at step  62000 :  4.812170468568802\n",
      "Average loss at step  64000 :  4.798992517709732\n",
      "Average loss at step  66000 :  4.980687629342079\n",
      "Average loss at step  68000 :  4.907456518173218\n",
      "Average loss at step  70000 :  4.795046325325966\n",
      "Nearest to more: less, absalom, roskilde, ona, microsite, quantifiers, most, not,\n",
      "Nearest to six: eight, four, five, seven, three, nine, zero, two,\n",
      "Nearest to be: been, have, by, were, is, are, refer, was,\n",
      "Nearest to known: used, noticing, reuptake, modernity, atkinson, exploding, hahn, rakyat,\n",
      "Nearest to two: three, four, six, one, five, seven, eight, dasyprocta,\n",
      "Nearest to use: risk, passport, dasyprocta, microseconds, callithrix, backslash, abitibi, microcebus,\n",
      "Nearest to for: in, of, and, with, including, or, against, dasyprocta,\n",
      "Nearest to other: reuptake, multiple, cebus, many, xhtml, ligature, different, conic,\n",
      "Nearest to first: bigelow, backslash, thaler, numa, potsdam, austin, heavy, abitibi,\n",
      "Nearest to from: through, in, into, vs, amalthea, during, agouti, on,\n",
      "Nearest to after: before, when, in, prism, abitibi, dasyprocta, marmoset, while,\n",
      "Nearest to there: they, it, he, which, easily, now, cebus, also,\n",
      "Nearest to years: four, lombardy, autocad, aramaean, vasco, six, callithrix, govern,\n",
      "Nearest to between: with, in, from, everywhere, dasyprocta, vaccination, recitative, around,\n",
      "Nearest to nine: eight, six, seven, five, zero, four, three, callithrix,\n",
      "Nearest to so: endeavour, winston, tamarin, advantages, amalthea, thz, ethnic, fidonet,\n",
      "Average loss at step  72000 :  4.805254952311516\n",
      "Average loss at step  74000 :  4.778901010006666\n",
      "Average loss at step  76000 :  4.864288411319256\n",
      "Average loss at step  78000 :  4.7959640386104585\n",
      "Average loss at step  80000 :  4.8224020563364025\n",
      "Nearest to more: less, most, absalom, roskilde, microsite, ona, not, quantifiers,\n",
      "Nearest to six: five, four, seven, eight, three, nine, two, zero,\n",
      "Nearest to be: have, been, by, was, were, are, refer, is,\n",
      "Nearest to known: used, noticing, reuptake, modernity, hahn, atkinson, exploding, microscope,\n",
      "Nearest to two: three, four, six, five, seven, one, eight, callithrix,\n",
      "Nearest to use: risk, passport, dasyprocta, microseconds, callithrix, backslash, microcebus, substituting,\n",
      "Nearest to for: dasyprocta, or, with, in, against, cebus, patricio, primigenius,\n",
      "Nearest to other: many, xhtml, reuptake, cebus, multiple, different, these, some,\n",
      "Nearest to first: bigelow, backslash, second, thaler, numa, potsdam, latter, austin,\n",
      "Nearest to from: through, into, in, on, during, amalthea, vs, at,\n",
      "Nearest to after: before, when, prism, in, during, abitibi, dasyprocta, marmoset,\n",
      "Nearest to there: they, it, he, which, now, easily, cebus, hardly,\n",
      "Nearest to years: six, autocad, aramaean, lombardy, vasco, callithrix, govern, coinciding,\n",
      "Nearest to between: with, in, recitative, from, dasyprocta, vaccination, everywhere, emphasised,\n",
      "Nearest to nine: eight, seven, six, five, four, zero, three, callithrix,\n",
      "Nearest to so: endeavour, advantages, tamarin, amalthea, winston, msg, fidonet, thz,\n",
      "Average loss at step  82000 :  4.811216924786567\n",
      "Average loss at step  84000 :  4.795806077718734\n",
      "Average loss at step  86000 :  4.753829827547073\n",
      "Average loss at step  88000 :  4.6853781598806385\n",
      "Average loss at step  90000 :  4.764363196372986\n",
      "Nearest to more: less, most, roskilde, absalom, ona, not, microsite, bengali,\n",
      "Nearest to six: eight, five, seven, four, nine, three, two, zero,\n",
      "Nearest to be: been, have, was, were, are, by, refer, is,\n",
      "Nearest to known: used, noticing, modernity, reuptake, exploding, hahn, atkinson, contrasting,\n",
      "Nearest to two: three, four, five, six, one, seven, eight, dasyprocta,\n",
      "Nearest to use: risk, passport, morgue, dasyprocta, substituting, backslash, callithrix, microseconds,\n",
      "Nearest to for: of, with, or, including, dasyprocta, patricio, cebus, microcebus,\n",
      "Nearest to other: xhtml, many, different, linebarger, multiple, reuptake, cebus, some,\n",
      "Nearest to first: backslash, second, bigelow, thaler, numa, latter, tamarin, potsdam,\n",
      "Nearest to from: through, into, in, amalthea, during, on, at, agouti,\n",
      "Nearest to after: before, when, during, prism, abitibi, dasyprocta, until, while,\n",
      "Nearest to there: they, it, he, which, now, cebus, easily, hardly,\n",
      "Nearest to years: autocad, lombardy, six, aramaean, five, months, vasco, coinciding,\n",
      "Nearest to between: with, in, from, dasyprocta, vaccination, recitative, everywhere, emphasised,\n",
      "Nearest to nine: eight, seven, six, five, four, zero, callithrix, dasyprocta,\n",
      "Nearest to so: advantages, endeavour, amalthea, winston, fidonet, msg, disabled, believe,\n",
      "Average loss at step  92000 :  4.711258682847023\n",
      "Average loss at step  94000 :  4.632245309472084\n",
      "Average loss at step  96000 :  4.737491478085518\n",
      "Average loss at step  98000 :  4.6130690822601315\n",
      "Average loss at step  100000 :  4.670656157255173\n",
      "Nearest to more: less, most, roskilde, absalom, microsite, ona, olympias, bengali,\n",
      "Nearest to six: seven, eight, five, four, nine, three, two, zero,\n",
      "Nearest to be: been, have, are, by, were, was, is, refer,\n",
      "Nearest to known: used, noticing, modernity, reuptake, called, tamarin, genuine, microscope,\n",
      "Nearest to two: four, three, five, six, seven, one, eight, callithrix,\n",
      "Nearest to use: risk, morgue, passport, dasyprocta, substituting, microseconds, microcebus, callithrix,\n",
      "Nearest to for: or, with, dasyprocta, in, during, and, cebus, including,\n",
      "Nearest to other: xhtml, many, linebarger, cebus, various, reuptake, different, multiple,\n",
      "Nearest to first: second, backslash, thaler, bigelow, numa, latter, next, under,\n",
      "Nearest to from: through, in, into, during, at, amalthea, on, agouti,\n",
      "Nearest to after: before, when, during, abitibi, prism, in, dasyprocta, until,\n",
      "Nearest to there: they, it, he, now, which, cebus, hardly, still,\n",
      "Nearest to years: six, autocad, aramaean, lombardy, months, days, bokassa, callithrix,\n",
      "Nearest to between: with, in, from, everywhere, vaccination, pleated, of, around,\n",
      "Nearest to nine: eight, seven, six, five, four, zero, three, callithrix,\n",
      "Nearest to so: advantages, endeavour, msg, believe, amalthea, tamarin, aon, winston,\n"
     ]
    }
   ],
   "source": [
    "sess = tf.InteractiveSession()\n",
    "tf.global_variables_initializer().run()\n",
    "\n",
    "average_loss = 0\n",
    "for step in range(100001):\n",
    "    batch_inputs, batch_labels = generate_batch(\n",
    "        batch_size, num_skips, skip_window)\n",
    "    feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}\n",
    "\n",
    "    # We perform one update step by evaluating the optimizer op (including it\n",
    "    # in the list of returned values for session.run()\n",
    "    _, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)\n",
    "    average_loss += loss_val\n",
    "\n",
    "    if step % 2000 == 0:\n",
    "        if step > 0:\n",
    "            average_loss /= 2000\n",
    "        print(\"Average loss at step \", step, \": \", average_loss)\n",
    "        average_loss = 0\n",
    "\n",
    "    if step % 10000 == 0:\n",
    "        sim = similarity.eval()\n",
    "        for i in range(valid_size):\n",
    "            valid_word = reverse_dictionary[valid_examples[i]]\n",
    "            top_k = 8 # number of nearest neighbors\n",
    "            nearest = (-sim[i, :]).argsort()[1:top_k+1]\n",
    "            log_str = \"Nearest to %s:\" % valid_word\n",
    "            for k in range(top_k):\n",
    "                close_word = reverse_dictionary[nearest[k]]\n",
    "                log_str = \"%s %s,\" % (log_str, close_word)\n",
    "            print(log_str)\n",
    "final_embeddings = normalized_embeddings.eval()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellType": "markdown",
    "uuid": "b4596fc6-47db-4df4-acf0-66256be341df"
   },
   "source": [
    "最后我们形象在画布上展示学习出来单词的向量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "uuid": "45e745b3-3307-4476-844c-dbdc475bae06"
   },
   "outputs": [],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "import matplotlib.pyplot as plt\n",
    "tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)\n",
    "plot_only = 200\n",
    "low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])\n",
    "labels = [reverse_dictionary[i] for i in range(plot_only)]\n",
    "\n",
    "plt.figure(figsize=(18, 18))  #in inches\n",
    "for i, label in enumerate(labels):\n",
    "    x, y = low_dim_embs[i,:]\n",
    "    plt.scatter(x, y)\n",
    "    plt.annotate(label,\n",
    "                 xy=(x, y),\n",
    "                 xytext=(5, 2),\n",
    "                 textcoords='offset points',\n",
    "                 ha='right',\n",
    "                 va='bottom')\n",
    "\n",
    "plt.savefig('result.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellType": "markdown",
    "uuid": "9d08d100-c415-4edf-a028-c3825dfb5379"
   },
   "source": [
    "![result](./result.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "uuid": "f65459f9-cd50-4113-a823-dd708f732bbd"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
